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1 Introduction 



Most of this book is dedicated to the presentation of models of neuronal networks and of methods devel- 
oped to work with them. Within the frameworks set by these models, the activity of populations of neurons 
is derived based on the intrinsic properties of "simplified" neurons and of their connectivity pattern. The 
experimentalist trying to test these models must start by collecting data for which model predictions can 
be made, which means he or she must record the activity of several neurons at once. If the model does in 
fact predict the compound activity of a local neuronal population (e.g. , a cortical column or hyper-column) 
on a relatively slow time scale (100 msec or more), techniques like intrinsic optical imaging [J ] are per- 
fectly adapted. If the model makes "more precise" predictions on the activity of individual neurons on 
a short time scale (~ msec) then appropriate techniques must be used like fast optical recording with 
voltage sensitive dyes [36] or the more classical extracellular recording technique [22]. 

We will focus in this chapter on the problems associated with the extracellular recording technique which 
is still a very popular investigation method. This popularity is partly due to its relative ease of implemen- 
tation and to its low cost. The author is moreover clearly biased toward this technique being one of its 
users. We hope nevertheless that users of other techniques will learn something from what follows. We 
will explain how to make inferences about values of the parameters of a rather complex model from noisy 
data. The approach we will develop can be adapted to any problem of this type. What will change from 
an experimental context to another is the relevant data generation model and the noise model. That does 
not mean that the adaptation of the method to other contexts is trivial, but it is still doable. 

The data generation model with which we will work in this chapter is qualitatively different from the 
ones which have been considered in the past [21, 34, 31, 30]. The reader unfamilliar with the spike- 
sorting problem and willing to get some background on it can consult with profit Lewicki's review [22]. 
To get an introduction on the actual statistical methods adapted to the models previously considered, we 
recommend the consultation of the statistical literature (eg, [ ]) rather than the spike-sorting one, which 
tends, in our opinion, to suffer from a strong taste for ad-hoc methods. 



2 The problem to solve 

Extracellular recordings are typically a mixture of spike waveforms originating from a generally unknown 
number of neurons to which a background noise is superposed as illustrated on Fig 1. Several features 
can be used to distinguish spikes from different neurons [ ] like the peak amplitude on a single of several 
recording sites (Fig 2), the spike width, a bi - or tri - phasic waveform, etc. 

What do we want? 

• Find the number of neurons contributing to the data. 

• Find the value of a set of parameters characterizing the signal generated by each neuron (e.g. , the 
spike waveform of each neuron on each recording site). 

• Acknowledging the classification ambiguity which can arise from waveform similarity and/or signal 
corruption due to noise, the probability for each neuron to have generated each event (spike) in the 
data set. 

• A method as automatic as possible. 

• A non ad - hoc method to answer the above questions. By non ad - hoc we mean a method based on 
an explicit probabilistic model for data generation. 
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Figure 1: Example of a tetrode recording from the locust (Schistocerca americana) antennal lobe. Left, 
recording setting. The probe, a silicon substrate made of two shanks with 16 iridium deposits (the actual 
recording sites), can be seen on the left side of the picture. Each group of 4 recording sites is called a 
tetrode (there are therefore 4 tetrodes by probe). The recording sites are the bright spots. The width of 
the shanks is 80 fim , the side length of each recording site is 13 \im , the diagonal length of each tetrode is 
50 \xm , the center to center distance between neighboring tetrodes is 150 fim . The structure right beside 
the probe tip is the antennal lobe (the first olfactory relay of the insect), its diameter is approximately 400 
fim . Once the probe has been gently pushed into the antennal lobe such that the lowest two tetrodes are 
roughly 100 fim below the surface one gets on these lowest tetrodes data looking typically as shown on 
the right part of the figure. Right, Is of data from a single tetrode filtered between 300 and 5kHz. 



* » Utr 
I I | 



Figure 2: The last 200 ms of Fig. 1. Considering two pairs of spikes (** and ##) the interest of the tetrodes 
becomes clear. On the first recording site (top) the two spikes of the pair (**) look very similar and it 
would therefore be hard for an analyst having only the top recording site information to say if these two 
spikes originate from the same neuron or from two different ones. If now, one looks at the same spikes 
on the three other recording sites the difference is obvious. The same holds for the two spikes of the pair 
(##). They look similar on sites 3 and 4 but very dissimilar on sites 1 and 2. 
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Figure 3: An example of a multi - electrode recording from a rat cerebellar slice. Left, picture of a slice 
(from a 38 days old rat) with a biocytin filled Purkinje cell. Middle, the tip of a Michigan probe drawn to 
scale. Right, 100 ms of data filtered between 300 and 5K Hz, Scale bar: 10 ms, Stars: a triplet of spike 
coming from a single Purkinje cell. (Delescluse and Pouzat, unpublished) 

3 Two features of single neuron data we would like to include 
in the spike-sorting procedure 

3.1 Spike waveforms from a single neuron are usually not stationary on a 
short time - scale 

3.1.1 An experimental illustration with cerebellar Purkinje cells 

One commonly observes that, when principal cells fire bursts of action potentials, the spike amplitude 
decreases 1 during the burst as illustrated on Fig. 3. 

3.1.2 A phenomenological description by an exponential relaxation 

Following [6] we will use an exponential relaxation to describe the spike waveform dependence upon the 
inter - spike interval: 

a (isi) = p ■ (1 — 8 ■ exp (—A ■ isi)) , (1) 

where p is the vector of maximal amplitudes (or full waveforms) on the different recording sites, 8 G [0, 1], 
A, measured in 1/s, is the inverse of the relaxation time constant. 

3.2 Neurons inter - spike interval probability density functions carry a lot of 
information we would like to exploit 

3.2.1 An experimental illustration from projection neurons in the locust antennal lobe 

It is well known and understood since the squid giant axon study by Hodgkin and Huxley that once a 
neuron (or a piece of it like its axon) has fired an action potential, we must wait "a while" before the next 
action potential can be fired. This delay is dubbed the refractory period and is mainly due to inactivation 
of sodium channels and to strong activation of potassium channels at the end of the spike, meaning 
that we must wait for the potassium channels to de - activate and for sodium channel to de - inactivate. 
Phenomenologically it means that we should observe on the inter - spike interval (ISI) histogram from 
a single neuron a period without spikes (i.e., the ISI histogram should start at zero and stay at zero for 
a finite time). In addition we can often find on ISI histograms some other features like a single mode 2 
a "fast" rise and a "slower" decay as illustrated on Fig 4. The knowledge of the ISI histogram can in 

1 More generally the spike shape changes and basically slows down [1 ]. This is mainly due to sodium channels inactivation. 
2 That's the statistician's terminology for local maximum. 
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Figure 4: An example of ISI pdf estimates for 8 projection neurons simultaneously recorded in the locust 
antennal lobe (Pouzat, Mazor and Laurent, unpublished). 



principle be used to improve spike - sorting because it will induce correlations between the labeling of 
successive spikes. Indeed, if in one way or another we can be sure of the labeling of a given spike to a 
given neuron, we can be sure as well that the probability to have an other spike from the same neuron 
within say the next 10 ms is zero, that this probability is high between 30 and 50 ms and high as well 
between 60 and 100 (you just need to integrate the ISI histogram to see that). 



3.2.2 A phenomenological description by a log - Normal pdf 

We will use a log - Normal approximation for our empirical ISI probability density functions (pdfs): 

1 



n(ISI = isi\S = s,F = f) = 



isi ■ f ■ \/2tt 



■ exp 



1 /In isi — Ins 

2' V 7 



(2) 



where S is a scale parameter (measured in seconds, like the ISI) and F is a shape parameter (dimension- 
less). Fig. 5 shows three log - Normal densities for different values of S and F. The lowest part of the 
figure shows that when we look at the density of the logarithm of the ISI we get a Normal distribution 
which explains the name of this density. 

In the sequel, we will in general use it to designate proper probability density functions (pdf for continuous 
random variables) or probability mass functions (pmf for discrete random variables). By proper we mean 
the integral (or the sum) over the appropriate space is 1. We will (try to) use uppercases to designate 
random variables (e.g., ISI, S, F) and lowercases to designate their realizations (e.g., isi, s, f). We will use 
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Figure 5: A, examples of log - Normal densities, plain: S = 0.025, F = 0.5; dashed: S = 0.05, F = 0.5; 
dot-dashed: S = 0.05, F = 0.75. B, peak normalized densities displayed with a logarithmic abscissa. 

a Bayesian approach, which means that the model parameters (like S and F in Eq. 2) will be considered 
as random variables. 



4 Noise properties 

It's better to start with a reasonably accurate noise model and Fig. 6 illustrates how the noise statistical 
properties can be obtained from actual data. Once the empirical noise auto - and cross - correlation func- 
tions have been obtained, assuming the noise is stationary, the noise covariance matrix (£) is constructed 
as a block - Toeplitz matrix. There are as many blocks as the square of the number of recording sites. The 
first row of each block is the corresponding auto- or cross - correlation function, see [31]. Then, if £ is a 
complete noise description, the pdf of a noise vector N is multivariate Gaussian: 

7r (N = n) = ■ (IT 1 ! * ■ cxp (-\ ■ n T £ _1 , (3) 

(2tt)~ V 2 / 

where D is the dimension of the space used to represent the events (and the noise), stands for the 

determinant of the inverse of S, n T stands for the transpose of n. 



4.1 Noise whitening 

If the noise covariance matrix is known, it is useful to know about a transformation, noise whitening, 
which is easy to perform. It makes the equations easier to write and computations faster. If £ is indeed a 
covariance matrix, that implies: 

v T S~ 1 v> 0,Vv G $ D . (4) 

That is, the Mahalanobis distance (v T Y,~ 1 v) is a proper distance which is stated differently by saying 
that the inverse of E is positive definite. Then one can show (Exercise 1 below) that there exists a unique 
lower triangular matrix A such that AA T = and transform vector v of Eq. 4 into w = A T \. We then 
have: 

w T w = v T E" 1 v. (5) 

That is, we found a new coordinate system in which the Mahalanobis distance is the Euclidean distance. 
We say we perform noise whitening when we go from v to w. 

Assuming the noise is stationary and fully described by its second order statistical properties is not 
enough. One can check the validity of this assumption after whitening an empirical noise sample as 
illustrated in Fig. 7. 
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Figure 6: Procedure used to obtain the second order statistical properties of the recording noise illustrated 
with a tetrode recording from the locust antennal lobe. Putative "events" (B, the sweeps are 3 ms long) and 
"noise" (C) are separated from the complete data (A) after a threshold crossing and template matching 
detection of the events. The noise auto - and cross - correlation functions (D, 3 ms are shown for each 
function, the autocorrelations functions are on the diagonal, the crosscorrelation functions on the upper 
part) are computed from the reconstructed noise traces (C). Adapted from [31]. 




150 190 230 270 310 350 



Figure 7: How good is a noise description based on a multivariate Gaussian pdf? A typical example (from 
locust antennal lobe) where £ turns out to be a reasonable noise description. A sample of 2000 noise events 
{ni, . . . ,n 2 ooo} was randomly selected from the reconstructed noise trace (Fig. 6C), the Mahalanobis 
distance to the origin: nf n l; was computed for each of them and the cumulative distribution of these 
distances (plain) was compared with the theoretical expectation (dashed) a x 2 distribution (with in that 
case 238 degrees of freedom). See [ ;■".] for details. 
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Exercise 1 



Assume that T = is a symmetric positive definite matrix. Show there exists a unique (in fact 2, one 
being obtained from the other by a multiplication by - 1) lower triangular matrix A (the Cholesky factor) 
such that: 

AA T = r. (6) 

In order to do that you will simply get the algorithm which computes the elements of A from the elements 
of r. See Sec. 13.1 for solution. 



5 Probabilistic data generation model 

5.1 Model assumptions 

We will make the following assumptions: 

1. The firing statistics of each neuron is fully described by its time independent inter - spike interval 
density That is, the sequence of spike times from a given neuron is a realization of a homogeneous 
renewal point process [ ]. More specifically, we will assume that the ISI pdf is log - Normal (Eq. 2). 

2. The amplitude of the spikes generated by each neuron depends on the elapsed time since the last 
spike of this neuron. More specifically, we will assume that this dependence is well described by Eq. 
1. 

3. The measured amplitude of the spikes is corrupted by a Gaussian white noise which sums linearly 
with the spikes and is statistically independent of them. That is, we assume that noise whitening 
(Eq. 5) has been performed. 



5.2 Likelihood computation for single neuron data 

With a data set V = {(t , a ) , (ii, ai) , . . . , (t N , &n)}, the likelihood is readily obtained (Fig. 8 ). One must 
first get the isi values: ij = t 3 - tj-\, which implies one has to discard the first spike of the data set 3 to 
get the "effective" data set: V = ai) , . . . , (ijv, a^)}. The likelihood is then 4 : 

N 

L(V | p, 5, A, S, /) = J~[ TT is i (ij | S, /) • TT amplitude (&j | ij, Pi 6, A) , (7) 

3=1 



where -K lsi (ij | s, /) is given by Eq. 2 and: 

T^amvlitude (a 7 " | S, A) = 

(27T) 



TTamphtude (Sij | ij,P,S, A) = ~ 1 _p ■ CXp j - ^ ||aj - P • (1 - S ■ CXp (-A ■ ^))|| 2 \ ■ 



This last equation is obtained by combining Eq. 1 with our third model assumption (Gaussian white 
noise). For convenience we write 7T isi (ij \ s, /) for n ist (ISIj =ij | S = s,F = f) and 

^amplitude (&j I ij, P, S, A) for 7t amplitude (A-j = SLj I ISIj = ij , P = p, A = S, A = A) , where the ISIj are consid- 
ered independent and identically distributed (iid) random variables (conditioned on S and F) as well as 
the Aj (conditioned on ISIj, P, A and A). 

3 To avoid discarding the first spike we can assume periodic boundary conditions, meaning the last spike (N) "precedes" the first 
one (0), we then have: «o = T — tjv + to , where T is the duration of the recording which started at time 0. 

4 Purists from both sides (frequentist and Bayesian) would kill us for writing what follows (Eq. 7)... A frequentist would rather 
write: 

L(p,S,X,s,f;V) 

because for him the likelihood function is a random function (it depends on the sample through V ) and its arguments are the 
model parameters. The Bayesian, probably to show the frequentist that he perfectly understood the meaning of the likelihood 
function, would introduce a new symbol, say h and write: 

h [V | p, 5, A, s, / ) = L (p, S, A, s, f ; V ) . 

But the likelihood is, within a normalizing constant, the probability of the data for given values of the model parameters. We will 
therefore use the heretic notation of Eq. 7. 
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Figure 8: Illustration of the likelihood computation for data from a single neuron recorded on a single 
site. For simplicity we assume that the spike peak amplitude only is used. 



The log - likelihood (C = In L) can be written as a sum of two terms: 

£(D | p,8,\,S,f) = C lsl (V | S, f ) + CampUtude (D | p, S, A ) . 



(9) 



where: 



iV 



£ isi (X> | s,/) = -N ■ In/ - £ ^ ln» 3 - + 



In 



(10) 



and: 



1 N 

CampUtude (D \ P,^,X) = -- 2j - P • (1 - (5 • Cxp (-A ■ ij))!! 2 + Cst. 



J'=l 



(ID 



The term £ isi is a function of the ISI pdf parameters only and the term CampUtude is a function of the 
"amplitude dynamics parameters" only. 

Exercise 2 



Starting with the expression of the log - likelihood given by Eq. 10 , show that the values s and / of s and 
f maximizing the likelihood are given by: 



1 N 

lns = iE ln 



(12) 



and: 



/ = 



N r 



In 



-i 2 



(13) 



Exercise 3 



Remember that in a Bayesian setting your start with a prior density for your model parameters: 7r prior (s, /), 
you observe some data, V , you have an explicit data generation model (i.e., you know how to write the 
likelihood function), LiV \ s, f) and you can therefore write using Bayes rule: 



\ V) = 



L (P \s,f)- TTprior (s, f) 
J uy dudvL(V \U,V) ■ Tr pr i or (u, v) 



(14) 



The "Bayesian game" in simple cases is always the same, you take the numerator of Eq. 14, you get rid of 
everything which does not involve the model parameters (because that will be absorbed in the normalizing 
constant) and you try to recognize in what's left a known pdf like a Normal, a Gamma, a Beta, etc. 



In the following we will assume that our prior density is uniform on a rectangle, that is: 



1 prior 



(«,/) 



1 



~rnax °m%n 



l 

fmax fit 



'%«.*»,/»«] (/)■ 



(15) 
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where l[ x . y ]is the indicator function: 



x,y] 



1, if x < u < y 
0, otherwise 



(16) 



Detailed answers to the following questions can be found in Sec. 13.2. 



Question 1 Using Eq. 8 and Eq. 2 show that the posterior density of In S conditioned on F = f and V 
is a truncated Gaussian with mean: 



1 N 

lni= N S^*'"' 



and un-truncated variance (that is, the variance it would have if it was not truncated): 

2 P 



N 



(17) 



(18) 



And therefore the posterior conditional pdf of the scale parameter is given by: 

1 



7r (s | /, T>) a exp 



1 N 
"2 J 



2 (in i — In s) 



■Jr.. 



(a). 



(19) 



Question 2 Write an algorithm which generates a realization of S according to Eq. 19. 



Question 3 We say that a random variable 6 has an Inverse Gamma distribution and we write: 6 
Inv — Gamma (a, (3) if the pdf of 9 is given by: 



*< e = e > = FR 9 - ( ° + M-£) 



(20) 



Using Eq. 8 and Eq. 2 show that the posterior density of F 2 conditioned on S = s and V is a truncated 
Inverse Gamma with parameters: 

N 1 
a = 1 

2 



and 



1 N 

^=-^(k !r k S ) 2 . 



Therefore the posterior conditional pd/' of the shape parameter is: 



?r(/ I 8,T>) a—^ exp 



lE^Li ( lni 3 - Ins) 21 



1 



fmax frt 



[/mi»,U] 



(/) 



(21) 



Question 4 Assuming that your favorite analysis software (e.g., Scilab 5 or Matlab) or your favorite 
C library (e.g., GSL 6 ) has a Gamma random number generator, write an algorithm which generates a 
realization of F from its posterior density Eq. 21. The pdf of a Gamma random variable O is given by: 



7T (f2 = Uj) 



u 01 - 1 exp (-/? • w) 



(22) 



http : / /www . scilab .org 
6 The Gnu Scientific Library: http: //sources . redhat . com/gsl 
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Figure 9: A simple example of sub-trains extraction. We consider a case with 2 neurons in the model 
and a single recording site. A, Spikes are shown with their labels in the present configuration and their 
occurrence times. The recording goes from to T. Bl, the sub-train attributed to neuron 1 has been 
isolated. The relevant ISIs are shown, as well as the amplitudes of the events. B2, same as Bl for the 
sub-train attributed to neuron 2. 



5.3 Complications with multi - neuron data 
5.3.1 Notations for the multi - neuron model parameters 

In the following we will consider a model with K different neurons, which means we will have to estimate 
the value of a maximal peak amplitude parameter (P), an attenuation parameter (A), the inverse of a 
relaxation time constant (A), a scale parameter (S) and a shape parameter (F) for each of the K neurons 
of our model. We will write 6 the random vector lumping all these individual parameters: 

e = (p 1 ,a u a u s 1 ,f u ...,p k ,a k ,a k ,s k ,f k ). (23) 



5.3.2 Configuration and data augmentation 

When we deal with multi - neuron data we need to formalize our ignorance about the origin of each 
individual spike. In order to do that, assuming we work with a model with K neurons, we will "attach" 
to each spike, j, in our data set (V) a label, l 3 e {1, . . . , A'} whose value corresponds to the number of the 
neuron which generated it. From this view point, a spike - sorting procedure is a method to set the labels 
values. But we don't know, at least until we have estimated the model parameters, what is the value of 
each lj. In such an uncertain situation the best we can do is to consider lj as a realization of a random 
variable Lj taking values in {1, . . . , A"}, then the best we can expect from a spike - sorting procedure is 
the distribution (or the pmf) of Lj for j e {1, . . . , N}. 

Our data generation model will necessarily induce a dependence among the Lj because we take into 
account both the ISI densities and the spike waveform dynamics, we will therefore introduce an other 
random variable, the configuration, C defined as a vector of L 3 that is: 

C ={L 1 ,...,L N ) T . (24) 

The introduction of this random variable C is a very common procedure in the statistical literature where 
it is called data augmentation. The key idea being that we are given an incomplete data set (V), which 
basically does not allow us to write the likelihood easily, and if we knew some extra properties of the 
data (like the configuration), we could write down the likelihood of the augmented (or completed) data in 
a straightforward way 7 . Indeed, with C introduced the likelihood of the augmented data (L (V, c | 9)) is 
obtained in two steps: 

C is also called a latent variable in the statistical literature. Statisticians, moreover, commonly use the symbol Z for our C. We 
have change the convention for reasons which will soon become clear. 
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1. Using the configuration realization, c, we can separate the "complete" data set into sub - trains 
corresponding to the different neurons of the model. This is illustrated with a simple case, K = 2 
and a single recording site on Fig. 9. Here we have: i^i = t 3 — t , ii,jvi-i = — £iv-i > ii,iVi = 
T — t N -i +to, where periodic boundary conditions are assumed and where N± is the number of spikes 
attributed to neuron 1 in c. For neuron 2 we get: i 2 ,i = t 2 — t\ , 12,2 = U — t 2 , 12, n 2 = T — t N -2 + h, 
where N 2 is the number of spikes attributed to neuron 2 in c. Using Eq. 8 we can compute a "sub - 
likelihood" for each sub - train: L lj= i (V, c \ 9), L tj=2 (V, c | 6) . 

2. Because our data generation model does not include interactions between neurons, the "complete" 
likelihood is simply 8 : 

K 

L(V,c\6) = l[L lj = q ('D,c\6). (25) 

q=l 



5.3.3 Posterior density 

Now that we know how to write the likelihood of our augmented data set, we can obtain the posterior 
density of all our unknowns 9 and C applying Bayes rule: 

(ft n I T)\ L{ Pi C \ 9 )- ^yrior (0) (C?R , 
^posterior \V,C \ V) = — , (Zb) 

with the normalizing constant Z given by: 

Z=J2 ! d9L(V,c\8)- iT pnor {9) , (27) 
cec' 

where C is the set of all configurations. 



5.4 Remarks on the use of the posterior 

It should be clear that if we manage to get the posterior: 7r poste? j or (6, c \ V), then we can answer any 
quantitative question about the data. For instance if we want the probability of a specific configuration c 
we simply need to integrate out 9: 

tt(c\V)= I d9 ir posterlor (0,c | V) . (28) 
Je 

If we are interested in the cross - correlogram between to neurons, say 2 and 4, of the model and if we 
formally write Ctoss2,a (c) the function which takes as argument a specific configuration (c) and returns 
the desired cross - correlogram, then the best estimate we can make of this quantity is: 

(Cross 2A ) = J^ d9 

^posterior 

{9, c\V)- Cross2,i (c) , (29) 



where we use the symbol () to designate averaging with respect to a proper pdf. 



5.5 The normalizing constant problem and its solution 
5.5.1 The problem 

If we take a look at our normalizing constant Z (Eq. 27) we see a practical problem emerging. To compute 
Z we need to carry out a rather high dimensional integration (on 9) and a summation over every possible 
configuration in C . That's where the serious problem arises because C contains K N elements! Blankly 
stated, we have no way, in real situations where K varies from 2 to 20 and where N is of the order of 1000, 
to compute Z. 

8 Strictly speaking, this is true only if we can ignore spike waveforms superpositions. 
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Fine, if we can't compute directly Z, let us look for a way around it. First, we introduce an "energy 
function": 

E{6,c\V) = - In [L (V, c | 0) • ir prior (9)} . (30) 

E is defined in such a way that given the data (V ), a value for the model parameters (60 and a configuration 
(c) we can compute it. 

Second, we rewrite our posterior expression as follows: 

expR3E^c | V)] 

^posterior (V , C \ V) = — , (01) 

where (3=1. 

Third, we make an analogy with statistical physics: E is an energy, Z a partition function, [3 is proportional 
to the inverse of the temperature {{kTy 1 ) and ^posterior (9, c | V) is the canonical distribution [29, 19]. 

Fourth, we look at the way physicists cope with the combinatorial explosion in the partition function... 



5.5.2 Its solution 

Physicists are interested in computing expected values {i.e., things that can be measured experimentally) 
like the expected internal energy of a system: 

cec Je Z 

or like the pair - wise correlation function between two molecules in a gas or in a fluid which will give 
rise to equations like Eq. 29. Of course they are very rarely able to compute Z explicitly so the trick they 
found fifty years ago [24] is to estimate quantities like (E) with Monte Carlo integration. 

At this point to make notation lighter we will call state of our "system" the pair (9, c) and we will write 
it as: x = (9,c). We will moreover drop the explicit dependence on V in E. That is, we will write 
E (x) for E(9,c \ V). Now by Monte Carlo integration we mean that we "create" a sequence of states: 
x W,xW,...,xM such that: 

lim E = (E), (33) 

where: 

i M 

is the empirical average of the energy. Showing how to create the sequence of states {x^} and proving 
Eq. 33 will keep us busy in the next section. 



6 Markov Chains 

Before embarking with the details of Monte Carlo integration let's remember the warning of one of its 
specialists, Alan Sokal [35]: 

Monte Carlo is an extremely bad method; it should be used only when all alternative meth- 
ods are worse. 

That being said it seems we are in a situation with no alternative method, so let's go for it. 

Our first problem is that we can't directly (or independently) generate the from the posterior density 
because this density is too complicated. What we will do instead is generate the sequence as a realization 
of a Markov Chain. We will moreover build the transition matrix of the Markov chain such that regardless 
of the initial state x^ , we will have for any bounded function H of x the following limit: 

M 

1™ T7 Y H (x (t) ) = (H). (35) 

t=l 
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In principle we should consider here Markov chains in a (partly) continuous space because x = (8, c) and 
8 is a continuous vector. But we will take the easy solution, which is fundamentally correct anyway, of 
saying that because we use a computer to simulate the Markov chain, the best we can do is approximate 
the continuous space in which 8 lives by a discrete and finite space. We therefore split the study of 
the conditions under which Eq. 35 is verified in the general state space (a product of a discrete and a 
continuous space) into two questions 9 : 

1. Under what conditions is Eq. 35 verified for a Markov chain defined on a discrete and finite state 
space? 

2. What kind of error do we produce by approximating our general state space (a product of a discrete 
and of a continuous state space) by a discrete one? 

We will answer here the first of these questions. For the second we will hope (and check) that our discrete 
approximation is not too bad. 

6.1 Some notations and definitions 

Now that we are dealing with a discrete and finite state space we label all our states x with an integer 
and our state space X can be written as: 

X = {xi, . . . , x u } . (36) 

We can, moreover, make a unique correspondence between a random variable (rv), X, and its probability 
mass function -kx that we can view as a row vector: 

70f = (7rx,l,70f,2, • • • ,TOf,i/) , (37) 

where: 

7T X ,i = Pr(X = Xi ) , Vx, G X. (38) 

Strictly speaking a rv would not be defined on a set like X but on a sample space S which would include 
every possible outcome, that is: all elements of X as well as every possible combination of them (like 
Xi U Xj, that we write Xi + x 3 ifi^j because then x t and Xj are mutually exclusive) and the empty set 
(corresponding to the impossible outcome). It is nevertheless clear that for a finite state space, knowing 
the pmf of a rv is everything one needs to compute the probability of any outcome (e.g., Pr (xi + Xj) = 
nx,i + ttxj )■ We will therefore in the sequel say that rvs are defined on the state space X acknowledging 
it is not the most rigorous way to speak. 

Formal definition of a Markov chain A sequence of random variables X^\ . . . , defined on a 
finite state space X is called a Markov chain if it satisfies the Markov property: 

Pr = y | X® =x,..., X<® = 3) = Pr (x^ = y | X<" = x) . (39) 

If Pr (X( t+1 > = y I X« = x) does not explicitly depend on t the chain is said to be homogeneous. 

An homogeneous Markov chain on a finite state space is clearly fully described by its initial random 
variable X^ and by its transition matrix 10 T(x,y) = Pr = y \ X^' = x). This transition matrix 

must be stochastic which means it must satisfy (Sec 13.3): 

T{x,y) = 1, Vz G X. (40) 

yex 

9 The results which will be presented in the next section have of course equivalents in the general case. The reference which is 
always cited by the serious guys is [25]. 

10 Be careful here, the notations are, for historical reasons, misleading. We write down the matrix: T (x,y) where 
x is the state from which we start and y the state in which we end. When we use the probability transition kernel 
notation Pr (y | x) the starting state is on the right side, the end site is on the left side! In this section where we will 
discuss theoretical properties of Markov chains, we will use the matrix notation which can be recognized by the the 
"," separating the start end end states. When later on, we will use the kernel notation, the 2 states will be separated 
by the "|" (and the end will be on the left side while the start will be on the right). 
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If m is the pmf of a rv defined on X , we can easily show that the row vector n defined as: 

V 

rij = ^2 miT(xi, Xj), 

1=1 

is itself a pmf (i.e., < < 1 , Vi G {1, . . . , v) and J2i=i n i = D an d must therefore correspond to a rv. 
That justifies the following vector notation: 

7!"X(*+i) = TxW^i (41) 

where 7r x(t ) and 7r X (t+i) are the pmf associated to two successive rvs of a Markov chain. 

Stationary probability mass function A pmf m is said to be stationary with respect to the transition 
matrix T if: mT = m. 

Applied to our Markov chain that means that if the pmf of our initial rv (tt^co) ) is stationary with respect 
to T, then the chain does not move (i.e., X (t+1) = X (t) , t > ). 

The notion of stationary pmf introduced we can restate our convergence problem (Eq 35). What we want 
in practice is a Markov transition matrix, T, which admits a single stationary pmf (which would be in our 
case ^posterior) and such that for any tt x v» we have: 

lim TTx(t) = ^posteriori 
t^oo 

which means: 

V x e X , V e > ,3t >0 : \n poste rior (x) - n x w < e. (42) 

It is clear that if we knew explicitly a candidate transition matrix, T, we could test that the posterior 
pmf is stationary and then using eigenvalue decomposition we could check if the largest eigenvalue is 1, 
etc... See for instance: [2, 7, 33, 23, 27, 35, 29]. The problem is that we won't in general be able to write 
down explicitly T. We need therefore conditions we can check in practice and which ensure the above 
convergence. These conditions are irreducibility and aperiodicity . 

Irreducible and aperiodic states (After Liu [23]) A state x G X is said to be irreducible if under 
the transition rule one has nonzero probability of moving from x to any other state and then coming 
back in a finite number of steps. A state x e X is said to be aperiodic if the greatest common divider of 

{t : T* (x, x) > 0} is 1. 

It should not be to hard to see at this point that if one state x G X is aperiodic, then all states in X must 
also be. Also, if one state y G X is aperiodic and the chain is irreducible, then every state in X must be 
aperiodic. 

Exercise 4 

Prove the validity of Eq. 40 for a Markov chain on a finite state space X (Eq. 36). See Sec. 13.3 for a 
solution. 

6.2 The fundamental theorem and the ergodic theorem 
Lemma 

Let (j' ',jW, . . .) be an irreducible and aperiodic Markov chain with state space X = {xi,xn, . . . ,x v } 
and transition matrix T. Then there exists an M < oo such that T n (xi,xj) > for all i, j e and 

n > M. 
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Proof 



(After Haggstrbm [11]) By definition of irreducibility and aperiodicity, there exist an integer N < oo such 
that T n (xi,Xi) > for all i e {1, . . . , u} and all n > N. Fix two states n, Xj e X . By the assumed 
irreducibility, we can find some 7i$j such that T" i j {x^Xj) > 0. Let Mij = N + riij. For any m > Mtj, we 
have: 

Pr (x^ = Xj | = x^j > Pr (x^ m ~ ni ^ = x h X^ = Xj \ X<- Q) = a*) , 

and: 

Pr (x^ m - n ^) = Xi ,X^ = Xj | X® = a*) = Pr (jf ^""^ = x t \ X^ = a*) Pr (x^ = Xj \ X^ = x^\ 

but m — m t j > implies that: Pr (X^ m ~ ni ' : >' = Xi \ X^ = xi) > and our choice of m t j implies that: 
Pr (X^ = Xj | X<® = Xi) > therefore: 

Pr (x^" 1 ' = Xj | = x^j > 0. 

We have shown that T m (xi,Xj) > for all m > Mij. The lemma follows with: 

M = max{Mi,i, Mi, a , • • ■ , M Vl M 2l i, . . . , M„, u } . 



Remark It should be clear that the reciprocal of the lemma is true, that is, if there exists an M < 
such that T n (xi,Xj) > for all i, j G {!,..., u] and n > M then the chain is irreducible and aperiodic. 
A transition matrix which has all its elements strictly positive, or such that one of its powers has all its 
elements strictly positive, will therefore give rise to irreducible and aperiodic Markov chains. 

We can now state and prove our first "fundamental" theorem which shows that irreducible and aperiodic 
Markov chains on finite state spaces enjoy geometric convergence to their unique stationary pmf. 



Fundamental theorem 

If an irreducible and aperiodic homogeneous Markov chain on a finite state space X = {xi,x 2 , ...,x u } 
with a transition matrix T (x, y) has ^stationary as a stationary distribution then regardless of the initial 

pmf tt X ( ) we have: 

Hm lt x W = ^stationary (43) 
t— >oo 

and there exists an 1 > e > such that for all x G X and alH > 1 we have: 

K stationary (x) ~ 7T X « (x)\ < (1 - e) (44) 



Proof 

Adapted from Neal [ ]. We can without loss of generality assume that T (x, y) > , Wx, y G X. If such 
is not the case for our initial transition matrix, we can always use the preceding lemma to find an M 
such that: T" (x, y) = T M (x, y) > and work with the Markov chain whose transition matrix is T' . Then 
the uniqueness of the stationary pmf regardless of the initial pmf will imply that the convergence of the 
"new" Markov chain holds for the initial Markov chain as well. 

So let's define: 

e = mm — > 0, 

x,y G X ^stationary (V) 

where we are assuming that n stationary (y) > 0, My G X (if not we just have to redefine the state space to 
get rid of y and to take out row and column y from T ). Then from the transition matrix definition and 
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the stationarity assumption we have: n stationary (y) = Y, x Stationary (x) T(x,y) ,Wy e X which with the 
above assumption gives: 



1 = 5> 



stationary 



(x) 



T{x,y) 



>e$>. 



stationary (*^) j 



^stationary (y) 

it follows that: e < 1. We will now show by induction that the following equality holds: 

TXW = 1 - (1 - e )* ^stationary (s) + (1 - e) r t (x) , Vx G A", 

where r t is a pm/. 

This equality holds for i = 0, with r = n X {«) ■ Let us assume it is correct for t and see what happens for 
t + 1. We have: 



(45) 



7r x(*+ 1 >(2/) 



^7r xW (x)T(x,y) 

X 

^1- (I- e) t ^^2ir s tationary(x)T(x,y) + (1 - e)* r t (x) T(x, y) 

a; cc 
^stationary 

(y) + (l-e)'^>(x) [T(x,y) 

stationary (y) + 

t7T stationary 

x 

1 - (1 - 77 ■ stationary (y) + ( 1 - <0* X! r * ( X ) ^) ~ e7r siatiormry (j/)] 

a: 

-, /-, \t+ll / \ , / -i it+l\^ / \ T{X,y) — C7T stationary \V) 
l-(l-e) TT stationary (V) + (1 - e) ^ r *W 



1 - (1 - e) t+1 7T stationary (y) + (1 - ?'t+l (j/) 

r(x,y) 



where: 



n+i (y) 



(l-e) 



1 - e 



"stationary (v) 



One can easily check that: r t+i (y) > 0, Vy G A" and that: ^ (y) = 1- That is, r t+ i is a proper pm/ on 
A. 



With Eq 45 we can now show that Eq 43 holds: 



{"^stationary {%) ~ T^M (x)| 



^stati 



,(x) - 1 - (1 - e) IT stationary (x) - (l - e)* r t (x) 



(1 - e) 7r stationar j ; (x) - (1 - e) r t (x) 



= (1 - e) |7r statiolmn/ (x) - r t (x) 



Ergodic theorem 

Let (X(°> fX^ 1 ' , . . .) be an irreducible and aperiodic homogeneous Markov chain on a finite state space 
X = {xi, x 2 , . . . , x„} with a transition matrix T (x, y) and a stationary distribution it stationary Let a be a 
real valued function denned on X and let ajv be the empirical average of a computed from a realization 

(x(°) , x« , . . .) of , X<» , . . .) , that is: o jv = ^ Et=o « (» (t) ) ■ Then we have: 

lim (ajv) = (a) OT8totio „ or!/ , 

N^OC sxaii.ona.Ty 

where: 

^ a )7r staflonarH = X! Stationary (x) a (x) , 

and: 

1 W 
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Proof 



Using Eq. 45 in the proof of the fundamental theorem we can write: 



{o-n) 



1 N 

= ^E E n * m W a ^ 



t=0 xGX 



1 N 

E a ^ n E 7 ^' ( 



ie x 



E < x ) 



x£X 



^stationary 



N N 

o^EC-O + ^E^*) 



t=0 



t=0 



^ ^ ^stationary (x) ^ (^) 

Lie* 



/ 1 W Y 

E a ( X ) huE^ ^* ^ ~~ ^staUonary (x)) J 



where £ corresponds to 1 — e in Eq. 45 and therefore: < C < 1- To prove the ergodic theorem we just need 
to show that the second term of the right hand side of the above equation converges to as N — > oo. For 
that it is enough to show that its modulus goes to 0: 



/ 1 N 

E«(*) hvEc'M*) 

x£X \ t=0 



^stationary 

(*)) 



J < E (jf EC 4 \n{x) 

J x £ X \ t=0 



It stationary (^) | 



1 N 

* Ei«wi^Ec 



We just need to recognize here the geometrical series J2t=o C which converges to when iV — > oo. That 
completes the proof. 



What did we learn in this section? For a given pmf it denned on a finite state space X if we can find 
an irreducible aperiodic Markov matrix T which admits it as its (necessarily only) stationary distribution, 
then regardless of our initial choice of pmf tt defined on X and regardless of the function a : X — > 
, an asymptotically unbiased estimate of (a) = J2 X ex a ( x ) 77 ( x ) can ^ e obtained by simulating the 
Markov chain to get a sequence of states: ( ( N >) and computing the empirical average: a = 



7 The Metropolis - Hastings algorithm and its relatives 

What we need now is a way to construct the transition matrix T and more specifically a method which 
works with an unnormalized version of the stationary density. We first introduce the notion of detailed 
balance. 



Detailed balance definition We say that a pmf it defined on a (finite) state space X and a Markov 
transition matrix T satisfy the detailed balance if: \fx, y G X, tv (x) T (x, y) = it (y) T (y, x). 



Detailed balance theorem If the pmf it defined on a (finite) state space X and a Markov transition 
matrix T satisfy the detailed balance then tt is stationary for T. 



Exercise 5 



Prove the theorem. See Sec. 13.4 for solution. 
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7.1 The Metropolis - Hastings algorithm 



7.1.1 Second fundamental theorem 

Let 7r be a pmf defined on a (finite) state space X and T and G two Markov transition matrices defined on 
X satisfying: 

T(x,y) = A(x,y)G(x,y)ifx^y 

t(x,x) = l- ^2 T ( x >y) 

y G X,y^x 

where G (y, x) = if G (x, y) = and 

a ^ = ^WM)' ifG{x ' y)>0 (46) 

= 1, otherwise 

then 

• 7r and T satisfy the detailed balance condition 

• if G is irreducible and aperiodic so is T 

Exercise 6 

Prove the theorem. 



THE MOST IMPORTANT COMMENT OF THESE LECTURES NOTES 

This theorem is exactly what we were looking for. It tells us how to modify an irreducible 
and aperiodic Markov transition (G) such that a pmf -k of our choice will be the stationary 
distribution and it does that by requiring a knowledge of the desired stationary n only up to a 
normalizing constant, because Eq. 46 involves the ratio of two values of n. 

G is often called the proposal transition and A the acceptance probability. The Metropolis algorithm [24] 
(sometimes called the M (RT) 2 algorithm because of its authors names) is obtained with a symmetrical 
G (i.e., G (x,y) = G (y,x)). The above formulation (in fact a more general one) is due to Hastings [14]. 
The set of techniques which involves the construction of a Markov chain (with the Metropolis - Hastings 
algorithm) to perform Monte Carlo integration is called Dynamic Monte Carlo by physicists [35, 29, 19] 
and Markov Chain Monte Carlo (MCMC) by statisticians [10, 33, 23]. 

7.2 Metropolis - Hastings and Gibbs algorithms for multi - dimensional spaces 

Talking of multi - dimensional spaces when we started by saying we were working with a discrete and 
finite one can seem a bit strange. It is nevertheless useful not to say necessary to keep a trace of the 
multi - dimensionality of our "initial" space (ideally all our model parameters: P fe , A fc , A fc , Sk, F k "live" in 
continuous spaces). If not it would be extremely hard to find our way on the map between the discrete 
approximations of all these continuous spaces and the genuine discrete and finite space on which our 
simulated Markov chain evolves. 

We consider therefore now that our random variables are "vectors" and X® becomes X'*'. We can think 
of it as follows: corresponds to p[ 1 \ the maximal peak amplitude of the first neuron in the model on 

the first recording site after t steps,..., X^ corresponds to P± \ the maximal peak amplitude of the first 
neuron in the model on the fourth recording site (assuming we are using tetrodes), Xg corresponds to 
the parameter of the first neuron, Xg corresponds to the parameter of the first neuron, X^ 
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corresponds to the parameter of the first neuron, Xg corresponds the parameter of the first 
neuron, corresponds to p!f[ the maximal peak amplitude of the second neuron in the model on the 
first recording site, etc. The problem is that it turns out to be very hard to find a transition matrix G acting 
on object like these "new"pm/s: 7r^( t) such that the acceptance probabilities (Eq. 46) are not negligible. 
The way around this difficulty is to build the transition matrix T as a combination of component - wise 
transition matrices like T\ which acts only on the first component of n-^t), T 2 which acts only on the 
second, etc. We just need to make sure that our combination is irreducible, and aperiodic (we assume here 
that we build each individual Tj such that ^posterior is its stationary pmf). A way to do that is to construct 
each Tj such that it is irreducible and aperiodic on its "own" sub - space which is obtained practically by 
building the matrix such that it has a strictly positive probability to produce a move anywhere on its sub 
- space. Then two combinations of these TjS are typically used, the first one being: 

T = wiTi + w 2 T 2 + ... + w m T m , 

where m is the number of components of the random vectors and were the WjS are components of a pmf 
defined on the the set of coordinates {1, ... , m}. It should be clear that if each Tj is irreducible and 
aperiodic on its own sub - space, then T will be irreducible and aperiodic on the "complete" state space. 
Because each Tj is built such that ^posterior is its stationary pmf, -^posterior will be the stationary pmf 
of T as well. The concrete implementation of this scheme would go as follows: at each "sub - step" a 
coordinate j is randomly drawn from w, then a random move is performed using Tj. It is customary to call 
"Monte Carlo step" a sequence of m successive such "sub - steps" (that means that on average each model 
parameter will be "changed" during a Monte Carlo step). 

The second combination is: 

T = T 1 xT 2 x...x T m , (47) 

that is, each model parameter is successively "changed". In the same way as above the irreducibility and 
aperiodicity of the TjS in their sub - spaces will give rise to an irreducible and aperiodic T on the "full" 
parameter space. The main difference is that detailed balance condition which be imposed by construction 
to the pairs Tj, ir posterior is not kept by the pair T, n 'posterior- We only have the stationarity property (which 
is enough to ensure convergence). Of course variations on that scheme can be used like using random 
permutations of the TjS (which would restore the detailed balance condition for the pair T, ir pos terior)- A 
"Monte Carlo step" for those schemes is obtained after the application of the complete series of TjS. See 
[ ] for details. 



7.2.1 An example: the Gibbs sampler for the parameters of the ISI density 

The previous discussion seemed probably a bit abstract for most of the readers. In order to be more 
precise about what we meant we will start by considering the following "simple" situation. Let's assume 
that we are given a sample of 25 isis: T> = {ii, . . . , 125} drawn independently from a log-Normal density 
with parameters: s actU ai, factual- We are asked for a Bayesian estimation of values of s and / (assuming 
flat priors for these parameters like in Exercise 3): 

T^isi, posterior (^) f | ^) Oi -^isi | /) ' ^isi,prior (^j /) > 



where L. lsi is easily obtained from Eq. 10: 



L is i (D I s, f) = exp 



N 



3 3=1 



(48) 



We do not recognize any classical pdf in Eq. 48 and we choose to use a MCMC approach. Following 
the previous discussion we will try to build our transition matrix T as a "product" T s x T F . Where T5 
does only change s and Tp does only change f and where both are irreducible and aperiodic on their 
own sub-space. According to the second fundamental theorem we first need to find proposal transitions: 
G s (snow,s P roposed | /, V) and Gf (fnow, f proposed \ s, V). But questions 1 and 3 of Exercise 3 provide us with 
such proposal transitions. In fact these transitions are a bit special because they do not depend on the 
present value of the parameter we will try to change and because they indeed correspond to the posterior 
conditional density of this parameter. A direct consequence of the latter fact is that the acceptance proba- 
bility (Eq. 46) is 1. An algorithm where the proposal transition for a parameter is the posterior conditional 
of this parameter is called a Gibbs sampler by statisticians and a heat bath algorithm by physicists. 

We therefore end up with the following algorithm (using the results of Exercise 3): 
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1. Chose randomly s (0) G [s min ,s max \ and / (0) G [/,„ m , f max \ . 

2. Given /W draw: 

where 7r ( | f^',T>) is defined by Eq. 19 (remark that s^ t+1 ^ is independent of s'*'), 

3. Given s(* +1 ) draw: 

~ 7r( | S ( t+1 ),I?) 

where tt ( | s ( t+1 ) , V) is defined by Eq. 21. 
Exercise 7 

Simulate 25 isi following a log-Normal density with values of your choice for the pair s actua i, factual- Im- 
plement the algorithm and compare its output with the maximum likelihood based inference. That is 
with the maximal likelihood estimates for s and f (given by Eq. 12 & 13). You should compute as well 
the Hessian of the log-likelihood function at its maximum to get confidence intervals (see the chapter of 
E. Brown). 



7.2.2 Generation of the amplitude parameters of our model 



By "amplitude parameters" we mean here the following parameters: P, A, A. Given a data set from a 
single neuron: V = {(«'i,ai) , . . . , (in, ajv)} (see Sec. 5.2) we now try to perform Bayesian inference on all 
its parameters: P, A, A, S, F. We again split our transition matrix T into parameter specific transitions: 
T = Tpj x Tp 2 x Tp 3 x Tp 4 x Ta X T\ x T s x T F . We have seen in the previous example how to get T s and 
Tp. Following the same line we could try to build a Gibbs sampler for the other parameters as well. The 
problem is that the part of the Likelihood function which depends on the amplitude parameters (obtained 
from Eq. 11): 



Lamp (£> | P, 5, A) = exp 



1 



jv 

0=1 



-p-(l-5-exp(-A-ij))|| 



does not correspond to any know pdf even when considered as a function of a single parameter, say 8. 
The reader can notice that such would not be the case if we had 8 = and if we knew it, see [30]. A 
robust solution to this problem is to use a piece-wise linear approximation of the posterior conditional as 
a proposal transition for each parameter (e.g. , Ga ( | p, A, s, f, V)) and then an acceptance probability as 
defined by Eq. 46. More specifically we can start with 101 regularly spaced "sampling values" of 8: 



8e{6 = 0, Si 



0.01, 



^99 



0.99, 8 W0 = 1}, 



compute 101 values of: 



Lamp iP | P, 8i,\) 



(49) 



and define: 

G A (8\p,X,s,f,V) 



A'. 



A 



t fn I ~ x \\ _■_ Lam P ( V p ' A ) ~ Lam P ( V Pl 6 ± A ) fx x \ 

Lamp \P | P,8 i; A) -| (p — Oi) 

Oi+i - d t 



where A/a ensures that Ga is properly normalized and where 8 G [Si, S i+ i]. The obvious problem with this 
approach is that we need to have a reasonably good piece-wise linear approximation of the corresponding 
posterior conditional pdf in order to get reasonable values for our acceptance probability. That means that 
when we start our Markov chain and we do not have a precise idea of the actual shape of this posterior 
conditional density we have to use a lot of sampling points. We spend therefore a lot of time computing 
terms like Eq. 49. A very significant increase of the algorithm speed can thus be obtained by using a 
first run with a lot of sampling points (say 101), then reposition fewer sampling point (say 13) 11 using 

11 We typically use 13 points because we want to have the 10 quantiles, the 2 extrema (here, S m i n and S m ax) allowed by our priors 
and an extra point corresponding to the smallest value obtained in our sample. The reader can "show" as an exercise the usefulness 
of this extra sample point. The easiest way to see the use of this point is to try without it and to remember that the piece- wise linear 
approximation must be normalized. 
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Figure 10: A, a simple one-dimensional target density (black curve, a Gaussian mixture: 
0.3 Af (0.5, 0.025 2 ) + 0.7 A/" (0.525, 0.015 2 ) ) together with its linear piece-wise approximation (gray curve) 
based on 100 discrete samples (the prior density is supposed to be flat between and 1 and zero every- 
where else). B, A sample of size 1000 is drawn using the piece-wise linear approximation as a proposal 
and a MH acceptance rule. The cumulative distribution of the sample is shown (black) together with the 
location of the 10 quantiles. C, Using the location of the 10 quantiles, the boundaries of the prior and the 
location of the smallest generated value, a sparser piece-wise linear approximation of the target density 
is built. 

the output of the first run. That's explained in details in [32] and illustration Fig. 10. When dealing 
with multiple neurons data, Eq. 25 shows that the same approach can be immediately applied after the 
introduction of the configuration. 

7.2.3 Generation of the configuration 

This can be done (almost) exactly as for a "normal" Potts model [29, 35] and is left as an exercise to the 
reader (the answer can be found in [32]). 

7.2.4 Our complete MC step 

In Sec. 11 we will give an example of our algorithm at work with the following sequential MC step: 

T h x . . . x Ti N x T Pl l x ... x T fl x ... x T Pk i x . . . x T fx (50) 



8 Priors choice 

We will assume here that we know "little" a priori about our parameters values and that the joint prior 
density n pr i or (9) can be written as a product of the densities for each component of 9, that is: 

K 

ir P rior{9) = \\ n(f q ) ■ Tr(s g ) ■ ir(8 q ) ■ ir(\ q ) ■ 7r(P g ,i) ■ 7r(P g , 2 ) ■ 7r(P g , 3 ) • 7r(P 9 , 4 ) 

9=1 

where we are assuming that four recording sites have been used. We will further assume that our signal 
to noise ratio is not better 20 (a rather optimistic value), that our spikes are positive and therefore the 
7r(Pg,i...4) are null below and above +20 (remember we are working with normalized amplitudes). We 
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will reflect our absence of prior knowledge about the amplitudes by taking a uniform distribution between 
and +20. The A value reported by Fee et al [ ] is 45.5 s _1 . A must, moreover be smaller than oo, we 
adopt a prior density uniform between 10 and 200 s _1 . S must be < 1 (the amplitude of a spike from 
a given neuron on a given recording site does not change sign) and > (spikes do not become larger 
upon short ISI), we therefore use a uniform density between and 1 for S. An inspection of the effect of 
the shape parameter / on the ISI density is enough to convince an experienced neurophysiologist that 
empirical unimodal ISI densities from well isolated neurons will have / G [0.1, 2]. We therefore take a 
prior density uniform between 0.1 and 2 for /. The same considerations leads us to take a uniform prior 
density between 0.005 and 0.5 for s. 



9 The good use of the ergodic theorem. A warning. 

The ergodic theorem is the key theoretical result justifying the use of Monte Carlo integration to solve 
tough problems. When using these methods we should nevertheless be aware that the theorem applies 
only when the number of Monte Carlo steps of our algorithms go to infinity 12 and because such is never 
practically the case we will commit errors. We would commit errors even if our draws where directly 
generated from 7r post . The difference between the MCMC/Dynamic MC based estimates and estimates 
based on direct samples ("plain MC") is that the variance of the estimators of the former have to be 
corrected to take into account the correlation of the states generated by the Markov chain. We explain 
now how to do that in practice, for a theoretical justification see Sokal [35] and Janke [17]. 



9.1 Autocorrelation functions and confidence intervals 

We have to compute for each parameter, 6>j, of the model the normalized autocorrelation function (ACF), 

Pnorm(l;0i), defined by: 

N T -l 

P m) = N N i -Y, ( g i (t) - ft) • - ft) 

1 u t=N D 

Pnorm{l;0i) = ^ ^| ( 51 ) 

Where Nt is the total number of MC steps performed and No is the number of steps required to reach 
"equilibrium" (see Sec. 9.2, 11.2 & 11.3). Then we compute the integrated autocorrelation time, T autoco (9i): 



i ) = -+Y d p{l;e i ) (52) 



7~autc 

Z 

1=1 

where L is the lag at which p starts oscillating around 0. Using an empirical variance, a 2 (6i) of parameter 
8i, defined in the usual way: 

^ft) = ^ r -^-i £(ft (t) -ft) 2 < 53 > 

1 u t=N D 

where 9i is defined like in Eq. 34. Our estimate of the variance, Var of 9i becomes: 

Var\6 i } = xlTZn-i * 1 ^ (54) 

In the sequel, the confidence intervals on our parameters estimates are given by the square root of the 
above defined variance (Table 2 & 3). 

We can view the effect of the autocorrelation of the states of the chain as a reduction of our effective sample 
size by a factor: 2T a „ toco (0j). This gives us a first quantitative element on which different algorithms 
can be compared (remember that the MH algorithm gives us a lot of freedom on the choice of proposal 
transition kernels). It is clear that the faster the autocorrelation functions of the parameters fall to zero, 
the greater the statistical efficiency of the algorithm. The other quantitative element we want to consider 

12 This is not even true because our algorithm use pseudo - random - number generators which among many shortcomings have a 
finite period. See Chap 7 of [7], Chap 2 of [33] and [20]. 
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is the computational time, r cpu , required to perform one MC step of the algorithm. One could for instance 
imagine that a new sophisticated proposal transition kernel allows us to reduce the largest T autoco of our 
standard algorithm by a factor of 10, but at the expense of an increase of r cpu by a factor of 100. Globally 
the new algorithm would be 10 times less efficient than the original one. What we want to keep as small 
as possible is therefore the product: T autoco • r cpu . 



9.2 Initialization bias 



The second source of error in our (finite size) MC estimates is a bias induced by the state {9^\ C^) with 
which the chain is initialized [35, 7]. The bad news concerning this source of error is that there is no 
general theoretical result providing guidance on the way to handle it, but the booming activity in the 
Markov chain field already produced encouraging results in particular cases [ ]. The common wisdom in 
the field is to monitor parameters (and labels) evolution, and/or functions of them like the energy (Eq. 30). 
Based on examination of evolution plots (eg, Fig. 12 & 13) and/or on application of time-series analysis 
tools, the user will decide that "equilibrium" has been reached and discard the parameters values before 
equilibrium. More sophisticated tests do exist [ ] but they wont be used in this chapter. These first two 
sources of error, finite sample size and initialization bias, are clearly common to all MCMC approaches. 



10 Slow relaxation and the Replica Exchange Method 



A third source of error appears only when the energy function exhibits several local minima. In the 
latter case, the Markov chain realization can get trapped in a local minimum which could be a poor 
representation of the whole energy function. This sensitivity to local minima arises from the local nature 
of the transitions generated by the MH algorithm. That is, if we use a sequential scheme like Eq. 50, at 
each MC time step, we first attempt to change the label of spike 1, then the one of spike 2, then the one 
of spike N, then we try to change the first component of 6 (Pi,i), and so on until the last component {a K )- 
That implies that if we start in a local minimum and if we need to change, say, the labels of 10 spikes 
to reach another lower local minimum, we could have a situation in which the first 3 label changes are 
energetically unfavorable (giving, for instance, an acceptance probability, Eq. 46, of 0.1 per label change) 
which would make the probability to accept the succession of changes very low ( 0.1 3 )... meaning that our 
Markov chain would spend a long time in the initial local minimum before "escaping" to the neighboring 
one. Stated more quantitatively, the average time the chain will take to escape from a local minimum 
with energy E min grows as the exponential of the energy difference between the energy, E* , of the highest 
energy state the chain has to go through to escape and E min : 

Tescape « exp [/? (E* - E mm )] 



Our chains will therefore exhibit an Arrhenius behavior. To sample more efficiently such state spaces, 
the Replica Exchange Method (REM) [15, 26], also known as the Parallel Tempering Method [12, 37, 5], 
considers R replicas of the system with an increasing sequence of temperatures (or a decreasing sequence 
of 0) and a dynamic defined by two types of transitions : usual MH transitions performed independently 
on each replica according to the rule defined by Eq. 50 and a replica exchange transition. The key idea 
is that the high temperature (low (3) replicas will be able to easily cross the energy barriers separating 
local minima (in the example above, if we had a probability of 0.1 3 to accept a sequence of labels switch 
for the replica at ft = 1, the replica at = 0.2 will have a probability 0.1 3 ' 2 w 0.25 to accept the same 
sequence). What is needed is a way to generate replica exchange transitions such that the replica at (3 
= 1 generates a sample from n post defined by Eq. 31. Formally the REM consists in simulating, on an 
"extended ensemble", a Markov chain whose unique stationary density is given by: 

7r ee (01, Ci, . . . , Or, Cr) = KpostSx Cl) • ■ ■ Kpost,0 B . Cr) (55) 



where "ee" in n ee stands for "extended ensemble" [16], R is the number of simulated replicas, (3\ > . . . > (3r 
for convenience and: 

cxp [-frE^d)] 

TTpost.ft (O l ,C l ) = Z(f3 ) 

That is, compared to Eq. 31, we now explicitly allow f3 to be different from 1. To construct our "complete" 
transition kernel we apply our previous procedure. That is, we construct it as a sequence of parameter, 
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label and inter-replica specific MH transitions. We already know how to get the parameter and label 
specific transitions for each replica. What we really need is a transition to exchange replicas, say the 
replicas at inverse temperature ft and ft+x, such that the detailed balance is preserved (Sec. 7): 

TTee (01, Cl, • • • j @ii C, 0i+l, Cj+i, . . . , Or, Cr) 7i,i+X (#i+X, C+x, @i, Ci \ 9i, Ci, 6*1+1, C+i) = 
TTee ($1, Cl, ■ ■ • , 6*i+x, C+x, Ci, . . . , Or, Cr) Tj^+x (6>j, Ci, 6*i+x, C+i | 9i + i, C+x, 6*i, C) 

which leads to: 

Ti,i+i (Oi, Ci, 9i + i, C+i | 6*i+i, C+i) 6*i, C) 7r ee (6*x, Cx, ■ ■ ■ ,&i, Ci, 6i + i, C+x, ■ ■ ■ , Or, Cr) 

(di+l, C+1, Oi, Ci | 6*i, Ci, 9i+\, Ci+x) TTee (6*x, Cx, • . • , Oi+i, Ci+x, 6*i, Ci, . . . , Or, Cr) 

Impost, @i 

(Oi,Ci) ■ ^post,/3 i+1 (6*1+1, Ci+x) 
7Tpost,ft (6*i+X, Ci+x) • (Oi,Ci) 

= exp [- 08, - ft+x) ■{E{9 il C i )-E (6 i+1 , C+x))] 

Again we write T l l+1 as a product of a proposal transition kernel and an acceptance probability. Here we 
have already explicitly chosen a deterministic proposal (we only propose transitions between replicas at 
neighboring inverse temperatures) which gives us: 

Ai,i+i (Oi, d, 0j+x, C+x | l+1 , C l+1 , Oi, d) 

= exp [- (ft - ft+i) ■ (E (Oi, d) - E (0 l+1 , C+x))] 

-^iji+l ("i+l, L'i+l, »»! I "ij ^ij "i+l, ^i+l) 

It is therefore enough to take: 

A U i +1 (0i,Ci,0 l+1 ,Ci +1 | 0i +1 ,Ci +1 ,0i, Ci) = min{l, exp [- (ft - ft +i ) ■ (B (^, C) - £? (6 i+1 , C +1 ))]} (57) 

The reader sees that if the state of the "hot" replica (ft+i) has a lower energy (E (Oi, d)) than the "cold" 
one, the proposed exchange is always accepted. The exchange can pictorially be seen as cooling down the 
hot replica and warming up the cold one. Fundamentally this process amounts to make the replica which 
is at the beginning and at the end of the replica exchange transition at the cold temperature to jump from 
one local minimum (0 i+ i, C +1 ) to another one (6i,Ci). That is precisely what we were looking for. The 
fact that we can as well accept unfavorable exchanges (i.e., raising the energy of the "cold" replica and 
decreasing the one of the "hot" replica) is the price we have to pay for our algorithm to generate samples 
from the proper posterior (we are not doing optimization here). 

In order for the replica exchange transition to work well we need to be careful with our choice of inverse 
temperatures. The typical energy of a replica (i.e., its expected energy) increases when P decreases (Fig. 
16A). We will therefore typically have a positive energy difference: AE = E hot — E co i d > between the 
replicas at low and high P before the exchange. That implies that the acceptance ratio (Eq. 57) for 
the replica exchange will be typically smaller than 1. Obviously, if it becomes too small, exchanges will 
practically never be accepted. To avoid this situation we need to choose our inverse temperatures such 
that the typical product: A/? • AE, where A/3 = p co u — Phot, is close enough to zero [28, 15, 16]. In practice 
we used pre-runs with an a priori too large number of Ps, checked the resulting energy histograms and 
kept enough inverse temperatures to have some overlap between successive histograms (Fig. 16B). 

In Sec. 11 we will perform replica exchange transitions between each pair ft, ft +1 with an even, respec- 
tively odd, i at the end of each even, respectively odd, MC step. With this replica exchange scheme, each 
MC time step will therefore be composed of a complete parameter and label transition for each replica, 
followed by a replica exchange transition. This scheme corresponds to the one described by Hukushima 
and Nemoto [ ]. A rather detailed account of the REM can be found in Mitsutake et al [26]. Variations 
on this scheme do exist [28, 4]. 



11 An Example from a simulated data set 

We will illustrate the performances of the algorithm with a simple simulated data set. The data are 
supposed to come from 3 neurons which are exactly described by our underlying model. Such an illus- 
tration has in our opinion several advantages. Being simple it helps the reader to concentrate on the 
inner working of the algorithm. Because the data correspond to the underlying model hypothesis, our 
implementation of the MCMC method should give us back the parameters used to simulate the data, 
we are therefore performing here a simple and necessary test of our code. The data do moreover exhibit 
features (strong cluster overlap, Fig. 11B,C) which would make them unusable by other algorithms. A 
presentation of the algorithm performances with a much worse data set can be found in [32]. 
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neuron 1 



neuron 2 



neuron 3 



S 
A 



15,9 
0.7 

33.33 
25 
0.5 

28.3 

1060 



8,8 
0.8 
40 
30 
0.4 
32.5 
923 



6,12 
0.6 
50 
18 
1.0 
29.7 
983 



s 



/ 

(isi) 
# 



Table 1: Parameters used to simulate the neurons. The maximal peak amplitude values (Pi) are given in 
units of noise SD. The scale parameters (s) and mean isi ((isi)) are given in ms. The bottom row indicates 
the number of events from each neuron. The correspondence between neuron number and color on Fig. 
11: 1, green, 2, red, 3, blue. 

11.1 Data properties 

The parameters used to simulate the three neurons are given in Table 1. 30 seconds of data were sim- 
ulated giving a total of 2966 spikes. The raw data, spike amplitude vs time on the two recording sites 
are illustrated on Fig. 11A1 & 11A2. Fig. 1 IB is a "Wilson plot" of the entire sample. Notice the strong 
overlap of points (spikes) arising from the 3 different neurons. Fig. 11C shows the theoretical iso-density 
contours for the clusters generated by each of the 3 neurons. Neuron 1 in Table 1 is green on the figure, 
neuron 2 is red and neuron 3 is blue. The reader can see that roughly 50% of the spikes generated by neu- 
ron 2 should fall in regions were neuron 1 or neuron 3 will also generate spikes. The theoretical densities 
associated with each neuron (cluster) are not 2-dimensional Gaussian. This can be most clearly seen for 
neuron 3 (blue iso-density contours) which has a "flat" summit. None of these densities is symmetrical 
with respect to its maximum along its principal axis (which is the axis going through the graph's origin 
and the neuron density maximum). Fig. 11D1 represents the amplitude dynamics of each of the three 
neurons, while Fig. 11D2 displays their respective ISI densities. 

11.2 Algorithm dynamics and parameters estimates without the REM 

11.2.1 Initialization 

The algorithm being iterative we have to start it somewhere and we will use here a somewhat "brute 
force" initialization. We choose randomly with a uniform probability -k as many actual events as neurons 
in the model (K=3). That gives us our initial guesses for the P qii . 5 is set to 5 mm = for each neuron. All 
the other parameters are randomly drawn from their prior distribution (Sec. 8). The initial configuration 
is generated by labeling each individual spike with one of the K possible labels with a probability l/K for 
each label (this is the (3 = initial condition used in statistical physics). 

11.2.2 Energy evolution 

At each MC step, a new label is proposed and accepted or rejected for each spike and a new value is 
proposed and accepted or rejected for each of the (18) model parameters (Eq. 50). Our initial state is very 
likely to have a very low posterior probability. We therefore expect our system to take some time to relax 
from the highly disordered state in which we have initialized it to the (hopefully) ordered typical states. 
As explained in Sec. 9.2, we can (must) monitor the evolution of "system features". The most reliable 
we have found until now is the system's energy as shown on Fig. 12. Here one sees an "early" evolution 
(first 60000 MC steps) followed by a much slower one which looks like a stationary regime during the last 
100000 steps 13 . 

13 The time required to perform such simulations depends on the number of sampling points used for the piece-wise linear pro- 
posals of the amplitude parameters (Sec. 7.2.2). During the first 2000 steps we use 101 regularly spaced sampling points. We then 
need less than 5' to perform 1000 steps. After that, we reposition the sampling points and use only 13 of them and the time required 
to perform 1000 steps falls to 50 s (Pentium IV computer at 3.06 GHz). 
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Figure 11: Simulated data with 3 active neurons recorded on a stereode (2 recording sites). 30 s of data 
were simulated resulting in 2966 spikes. A, An epoch with 100 spikes on site 1 (Al) and 2 (A2). Vertical 
scale bar in units of noise SD. B, Wilson plot of the 2966 spikes: the amplitude of each spike on site 2 is 
plotted against its amplitude on site 1. The scale bars corner is at the origin (0,0). C, Expected iso-density 
plots for each of the three neurons: neuron 1 (green), neuron 2 (red) and neuron 3 (blue). Dl, Normalized 
spike amplitude vs ISI for each neuron (normalization done with respect to the maximal amplitude). 
Horizontal scale, same as D2. D2, ISI density of each of the 3 neurons (peak value of blue density: 35 Hz). 



Energy evolution without the REM 
100-1 




Figure 12: Energy evolution without the REM. Notice the step like decreases in the early part. 
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Figure 13: Evolution of the maximal peak amplitude on the two recording sites for the three neurons. 
Notice that an apparently stationary regime is reached quickly for those parameters, while the energy 
(Fig. 12) is still slowly decreasing. 



11.2.3 Model parameters evolution 

We can monitor as well the evolution of models parameters like the maximal peak amplitudes of the neu- 
rons 14 as shown on Fig. 13. The interesting, and typical, observation here is that the model parameters 
reach (apparent) stationarity much earlier than the energy (compare the first 60000 steps on Fig. 12 & 
13). That means that most of the slow relaxation of the energy has a configurational origin. In other 
words, it comes from spike labels re-arrangements. 



11.2.4 Model parameters estimates 

Based on the energy trace (Fig. 12) and on the evolution of the model parameters (e.g., Fig. 13) we 
could reasonably decide to keep the last 100000 MC steps for "measurements" and estimate the posterior 
(marginal) densities of our model parameters from these steps. We would then get for neuron 2 what's 
shown on Fig. 14. The striking feature of this figure is the time taken by the ACFs of the amplitude 
parameters (Pi, P 2 , 6, A) to reach 0: 4000 steps. With the present data set we observe a similar behavior 
for the amplitude parameters of the first neuron but not for the third. That explains the un-acceptably 
low precision on the parameters estimates reported in Table 2 for neurons 1 & 2. 



11.3 Algorithm dynamics and parameters estimates with the REM 

The different behaviors of the model parameters, which relax to (apparent) equilibrium relatively fast 
(2 • 10 4 to 3 • 10 4 MC steps) and of the energy which relaxes more slowly (6 ■ 10 4 MC steps) suggests 
that the configuration dynamics is slower than the model parameters dynamics. It could therefore be 
worth considering a modification of our basic algorithm which could speed up the configuration (and if 
we are lucky the model parameters) dynamics like the Replica Exchange Method (Sec. 10). To illus- 
trate the effect of the REM on our simple data set we have taken the first 2000 MC steps of the pre- 
vious section and restarted 15 the algorithm augmenting our inverse temperature space from (3 e {1} to 
f3 G {1,0.95,0.9,0.85,0.8,0.75,0.7,0.65,0.6,0.55,0.5}. The resulting energy trace for /? = 1 and 30000 ad- 
ditional steps is shown on Fig. 15. The striking feature here is that the energy reaches after roughly 
1.2 • 10 4 steps a lower energy than after 30 • 10 4 steps without the REM. Our REM scheme with 11 inverse 

14 We monitor in practice the evolution of every model parameter. We show here only the evolution of the maximal peak amplitude 
to keep this chapter length within bounds. 

15 When we introduce new /3 values from one run to the next, the initial state of the replica with the new (3 values is set to the 
final state of the replica with the closest (3 in the previous run. That is, in our example, at step 2001 all the replicas are in the same 
state (same parameters values, same configuration). 
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Figure 14: Marginal posterior estimates for the parameters of neuron 2. The left part shown the estimated 
density of each of the six parameters (for the peak amplitudes, the first site corresponds to the blue curve 
and the second to the green curve). The right side shows the corresponding ACFs (Sec. 9.1 and Eq. 51), 
the abscissa is the lag in MC steps. 
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(32) 
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33 ±64 


39 ± 137 


47 ± 13 




(1250) 


(1043) 


(45) 


s 


24.9 ±0.1 


29.8 ±0.1 


18.3 ±0.5 




(6.5) 


(7) 


(1) 


1 


0.51 ±0.05 


0.40 ±0.05 


1.01 ±0.03 




(6.5) 


(7) 


(1) 



Table 2: Estimated parameters values using the last 10 5 MC steps (among 3 • 10 5 ). The SDs are autocor- 
relation corrected (Sec. 9.1). The integrated autocorrelation time (Eq. 52) is given below each parameter 
between "()". 
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Energy evolution with the REM 




Figure 15: Energy evolution with the REM. The ordinate scale is the same as in Fig. 12. 

temperatures implies that each step requires 11 times more CPU time than one step with a single inverse 
temperature 16 . From the energy evolution view point the REM is clearly more efficient than a single long 
simulation at /3 = 1. A more striking illustration of the REM effect on energy relaxation (on an other data 
set) can be found in [32]. 

11.3.1 Making sure the REM "works" 

As explained in Sec. 10 the REM will work efficiently only if the energies explored by replicas at ad- 
jacent exhibit enough overlap. Fig. 16 illustrates how the efficiency of the "replica exchange" can be 
empirically assessed [12, 37, 5, 26]. Energy traces at different f3 values are displayed on Fig.l6A. The 
overlap between adjacent energy traces is already clear. The lowest trace (j3 = 1) is the trace of Fig. 15. 
Fig. 16B shows the energy histograms for the different f3. Adjacent histograms clearly exhibit a signifi- 
cant overlap. A pictorial way to think of the REM is to imagine that several "boxes" at different pre-set 
inverse-temperatures are used and that there is one and only one replica per box at each step. After each 
MC step, the algorithm proposes to exchange the replicas located in neighboring boxes (neighboring in 
the sense of their inverse-temperatures) and this proposition can be accepted or rejected (Eq. 57). Then 
if the exchange dynamics works properly one should see each replica visit all the boxes during the MC 
run. More precisely each replica should perform a random walk on the available boxes. Fig. 16C shows 
the random walk performed by the replica which start at /3 = 1 at step 2001. The ordinate corresponds to 
the box index (see legend). Between steps 2001 and 32000, the replica travels through the entire inverse 
temperature range. 

One of the shortcomings of the REM is that it requires more f3 to be used (for a given range) as the number 
of spikes in the data set increases because the width of the energy histograms (Fig. 16B) is inversely 
proportional to the square root of the number N of events in the sample [15, 12, 16]. The necessary 
number of f3 grows therefore as y/N. The computation time per replica grows, moreover, linearly with N. 
We therefore end up with a computation time of the REM growing like: N 1 - 5 . 

11.3.2 Posterior estimates with the REM 

We do not present here the Equivalent of Fig. 13 with the REM because the two figures would look too 
similar. The best way to see the effect of the REM on the model parameters dynamics is to look at the 
empirical integrated autocorrelation time (IAT) for each parameter as show in Table 3. When we compare 
these values with the ones obtained with our basic algorithm (without the REM), we see for instance 
that longest IAT is now 110 steps (parameter A of the second neuron) while the longest IAT without the 

16 The computational overhead required accept or reject the proposed replica exchanges (Eq. 57) is negligible compared to the time 
required to update every spike label and every model parameter. 
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Figure 16: Test of the REM. A, Energy evolution for several f3 values (see text) during successive runs. 
1 replica is used between steps 1 and 2000, 11 replicas are used between steps 2001 and 32000. B, 
Energy histograms at each /3 value (histograms computed from the 10000 last steps with 25 bins): the 
left histogram correspond to j3 = 1, the right one to j3 = 0.5. C, Path of the first replica in the temperature 
space. The ordinate corresponds to the /? index (1 corresponds to /? = 1 and 11 to (3 = 0.5 ). 



REM was 1250 steps (parameter A of the first neuron, Table 2). We therefore get much better parameters 
estimates although the absolute sample size we use 10 4 is 10 times smaller than the one used previously. 
This means that in this simple setting we were able without any fine tunning (we could have in fact 
used fewer inverse temperatures, we have not optimized the number of steps between replica exchange 
attempts) to get a much more efficient algorithm (in term of statistical efficiency and of relaxation to 
equilibrium) without extra computational cost. 



11.3.3 Configuration estimate 

As a quick way to compare estimated and actual configurations we can introduce what we will (abusively) 
call the "most likely" configuration estimate. First, we estimate the probability for each spike to originate 
from each neuron. For instance, if we discard the first Nd = 22000 on a total of Nt = 32000 steps we have 





neuron 1 


neuron 2 


neuron 3 




15 ±2 


8±2 


6.1 ± 0.1 




(62) 


(91) 


(12) 


~P~2 


9± 1 


8±2 


12.1 ±0.3 




(55) 


(93) 


(14) 




0.70 ±0.07 


0.8 ±0.3 


0.58 ±0.02 




(25) 


(93) 


(9) 


A 


34 ± 19 


40 ±38 


46 ±7 




(60) 


(110) 


(20) 


s 


24.9 ±0.6 


30.0 ±0.4 


18.3 ±0.7 




(2) 


(1.5) 


(2.5) 


1 


0.51 ±0.02 


0.40 ±0.01 


1.01 ±0.03 




(2) 


(1.5) 


(2.5) 



Table 3: Estimated parameters values using the last 10 4 MC steps (among 6 • 10 4 ). The SDs are autocor- 
relation corrected (Sec. 9.1). The integrated autocorrelation time (Eq. 52) is given below each parameter 
between "()". 
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Figure 17: Comparison between the actual and estimated configurations. Top, the Wilson plot of Fig. 11B 
color coded to show the neuron of origin of each spike. Bottom left, the most likely configuration (see text) 
estimated from the last 10000 steps with the REM. Bottom right, the 50 errors (among 2966 events). 

for the probability of the 100th spike to have been generated by the second neuron: 



where I q is the indicator function defined Eq. 16. 

We then "force" the spike label to its most likely value. That is, if we have: Pr(l wo = 1 | Y) = 0.05, 
Pr(hoa = 2 | Y) = 0.95 and Pr(l W Q = 3 | Y) = 0, we force Zi o to 2. We proceed in that way with each 
spike to get our "most likely" configuration. Fig. 17 shows what we get using the last 10 • 10 3 MC steps 
of our run using the REM. The actual configuration is shown too, as well as the 50 misclassified events. 
We therefore get 1.7% misclassified spikes with our procedure. We leave as an excercise to the reader to 
check that using an exact knowledge of the model parameters and the theoretical 2-dimensional Wilson 
plot densities one can get from them (Fig. HQ we would get roughly 9% of the spikes misclassified. That 
would be the optimal classification we could generate using only the amplitude information. Our capacity 
to go beyond these 9% clearly stems from our inclusion of the ISI density in our data generation model. 
Using the last 10 • 10 3 MC steps of our run without the REM we generate 63 misclassifications, while we 
get "only" 59 of them with the last 100 • 10 3 steps 17 . 

11.3.4 A more detailed illustration of the REM dynamics. 

When the posterior density one wants to explore is non isotropic a problem arise with the use of an MH 
algorithm based on component specific transitions. In such cases, the transitions proposed by the MH 
algorithm are not optimal in the sense that only "small" changes of the present parameters values are 
likely to be accepted. These "small" changes will lead to a "slow" exploration of the posterior density and 
therefore to "long" autocorrelation times. This is illustrated for our MH algorithm without the REM on the 
left side of Fig. 18. Here we show only the situation in the plane defined by the maximal peak amplitude 
on site 1 and the parameter A of neuron 1 (i-2,ij but the reader should imagine that the same holds in 
the 4 dimensional spaces defined by the 4 amplitude parameters of each neurons. The ACFs (Eq. 51) of 
the two parameters are shown. The last 1000 values of the parameters generated by the MH algorithm 
without the REM with the last 50 of them linked together by the broken line are shown. The movement of 
the "system" in its state space has clear Brownian motion features: small steps, random direction change 
from one step to the next, resulting in a "slow" exploration of the posterior density. If we now look at the 

17 This difference in the number of misclassification explains the energy difference observed in the two runs (Fig. 12 & 15). It 
means clearly that the run without the REM has not reached equilibrium. 
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Figure 18: REM and parameters auto-correlation functions. The ACFs of the maximal peak amplitude of 
neuron 2 on site 1 (Pa,i ) are shown together with the ACFs of the parameter A of neuron 2. The ACFs 
without the REM (left side) were computed from the 100000 last iterations (Fig. 12). The ACFs with the 
REM (right side) were computed from the 10000 last iterations (Fig. 15). The ACFs with REM clearly fall 
faster than the ones without. The two graphs at the bottom give a qualitative explanation of why it is so. 
On each graph the last 1000 generated values for P 2 ,i (abscissa) and Ai (ordinate) are shown (black dots). 
50 values generated successively are linked together by the broken lines (green). The size of the jumps is 
clearly larger with REM than without. 



system's dynamics using the REM, we see that the ACFs fall to zero much faster (right side of Fig. 18). 
This is easily understood by observing successive steps of the systems in the corresponding plane (A vs 
P 2 ,i graph, bottom). Large steps going almost from one end to the other of the marginal posterior density 
are now observed, meaning that the resulting Markov chain explores very efficiently the posterior density 
we want to study. This is clearly due to the fact that the high temperature replicas can make much larger 
steps with a still reasonably high acceptance probability. The replica exchange dynamics (cooling down 
"hot" replicas and warming up "cold" ones) does the rest of the job. This increased statistical efficiency 
(Tautoco gets smaller) does in fact compensate for the increased computational cost (r cpu gets larger) of the 
REM (compare the integrated autocorrelation times in Table 2 & 3). 



12 Conclusions 

We have given a detailed presentation of a simple data generation model for extracellular recording. 
Although simple this model is already much more sophisticated than the ones previously used [21, 34, 31, 
30], but this sophistication comes at a price: the usual algorithms presented in the spike-sorting literature 
cannot be used. To work with our model we had to find an analogy between our problem and the Potts 
spin glass model studied in statistical physics. This analogy allowed us to get a Dynamic MC/MCMC 
algorithm with which model parameters and configurations could be estimated with confidence intervals. 

We have not explained here how one of the critical questions of spike-sorting, the number of neuron 
contributing to the data, can be tackled. We have recently described in [32] how this can be done with an 
ad-hoc method. The next step is clearly a Bayesian solution to this problem which fundamentally implies 
estimating the normalizing constant Z of Eq. 26 and 27 (which is the probability of the data). Nguyen et 
al [30] have already given a solution for a simple data generation model. This task of estimating Z could 
seem daunting but is luckily not. Dynamic MC and MCMC methods have been used for a long time which 
means that other people in other fields already had that problem. Among the proposed solutions, what 
physicists call thermodynamic integration [19, 8] seems very attractive because it is known to be robust 
and it requires simulations to be performed between p = 1 and [3 = 0. And that's precisely what we are 
already (partially) doing with the REM. Of course the estimation of Z is only half of the problem. The 
prior distribution on the number of neurons in the model has to be set properly as well. Increasing the 



35 



number of neurons will always lead to a decrease in energy which will give larger Z values (the derivative 
of the log of Z being proportional to the opposite of the energy [ ]), we will therefore have to compensate 
for this systematic Z increase with the prior distribution. 

Finally, although our method requires fairly long computation to be carried out, we hope we have con- 
vinced our reader that the presence of strong overlaps on Wilson plots does not necessarily precludes good 
sorting. 



13 Exercises solutions 

13.1 Cholesky decomposition (or factorization) 

Write dij the elements of A and a^j the elements of T, compute: 

AA T = B 

with: b ly j = J2k=i a i,ka,j,k and identify individual terms, that is: 

0-1,1 = y/^Xl, 

where the fact that A is lower triangular has been used. 

a,j,i = ^ , 2 < j < D. 

Ol,l 



02,2 — \ C2,2 ~ a 2 1- 



And keep going like that and you will find: 



for i = 1 to D 
for j = i . + 1 to D 



Ei—l 
k=l a j,kO-i,k 

Oj A 



For details see [1]. 



13.2 Bayesian posterior densities for the parameters of a log-Normal pdf 

Answer to question 1 We have: 



7r (s \ f,T>) a exp 



1 1 



N 



2 f 2 ^ 

3 3=1 



Y^Qnij - Ins)' 



If we introduce Ini denned by Eq. 17 in the summed term we get: 

N N 

(In ij — In s) 2 = (in ij — In i + In i — In s) 

3=1 j=i 

N N 

= (inij - ml) + 2 (Jni - In s) • ^ (in ij — hn) + N (Jni - In s) 

3=1 3=1 

The first term does not depend on s, the second is zero, we therefore have: 



7r (s | /, T>) a exp 



1 N ir~ i ^ 

--— (In i- Ins) 



1 
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Answer to question 2 

1. Generated ~ Norm (lni, 



18 



2. if s min < expu < s m ax, then 5 = expw 
otherwise, go back to 1 



Answer to question 3 Using the first line of Answer 1 we get: 



tt(/ I s,V) a-jj exp 



Identification does the rest of the job. 



l EjLi ( ln »j - Ins) 2 
2 /2 



1 



frnax fn 



(/) 



Answer to question 4 Using = ^ and the Jacobian of the change of variable, you can easily show 
that if fl ~ Gamma (a, /3) then 6 = i ~ /ra — Gamma (a, /?). The algorithm is simply: 



1. Generate ~ Gamma — 1) | J2f=i (hi 



ij — Ins) 



2. if / m i„ < y i < /maz , then F = y' i 
otherwise go back to 1. 

13.3 Stochastic property of a Markov matrix 

jSf(t+i) being a n; defined on X we must have by definition: 

Pr (X< t+1 ) C #) =1, 



where c A" should be read as: X^ t+1 ^ = x\ + x 2 

exclusive events). We have therefore: 



x v (remember that the X{ are mutually 



i=l 

= £ Pr = ^ | C AfJ Pr f Jf <*) C A" 

i=l 

= Y.Y.Pr (x (t+1) = Xl | X® = sj) Pr (iW = x 

j^Pr (x {t+l ^ = Xl | = 



]TPr (X« = 

3=1 



Which implies: 



£Pr(x 



3=1 



^(t+1) = Xj | X (t) 



i=l 



And that must be true for any X^ which implies that 19 : 

^Pr(x(*+« = Xi \X® = Xj ) =1 



i=l 



18 This notation means that the random variable U has a normal distribution with mean: In i and variance: ^ . 

19 If you're not convinced you can do it by absurdum assuming it's not the case and then find a X^'which violate the equality. 
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13.4 Detailed balance 

If Vx, ye X, tt(x)T (x, y) = ir (y) T (y, x) then: 

^2 n(x)T(x,y) = n (y) T (y, x) 

x ex x ex 

= n{y) ^2 T (v> x ) 

xex 

= Ti" (y) 

Remember that T is stochastic. 
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